Skip to content

Add M nearest-neighbour Chatterjee correlation (#990)#1414

Merged
mborland merged 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn
Jul 2, 2026
Merged

Add M nearest-neighbour Chatterjee correlation (#990)#1414
mborland merged 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn

Conversation

@su-senka

@su-senka su-senka commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Implements the revised (M nearest-neighbour) Chatterjee rank correlation of
Lin & Han (2021), addressing #990. Adds a new function
chatterjee_correlation_mnn(u, v, M) alongside the existing
chatterjee_correlation, with the same C++11 and C++17 overload structure.

The original coefficient has a detection boundary of n^(-1/4) for independence
testing, well short of the parametric n^(-1/2) rate. By using the M right
nearest neighbours of each point (rather than the single right neighbour) and
letting M grow with n, the revised statistic consistently estimates the same
dependence measure while approaching near-parametric efficiency. See Lin & Han,
On boosting the power of Chatterjee's rank correlation, Biometrika 110(2)
(2023) 283–299, arXiv:2108.06828.

Design notes

  • Separate function rather than an extended signature. The M-NN statistic
    uses min(R_i, R_j) and a different normalisation, so even at M = 1 it is not
    identical to chatterjee_correlation. A distinct function avoids silently
    changing existing results and keeps the statistical intent explicit.
    M is a required argument with no default.

  • Rank base. The internal rank() returns 0-based ranks; the paper's
    formula uses 1-based ranks. The offset cancels in the existing M = 1 statistic
    (which uses |R_i - R_{i+1}|) but not under min(.,.), so it is applied
    explicitly. This is noted in a comment where it matters.

  • Complexity. O(n log n + nM). Near-linear for small M; tends to O(n²) as
    M → n.

  • Parallel path. The outer index loop is partitioned across threads into
    disjoint ranges, each reading the shared rank vector read-only (indices up to
    i + M may fall in a neighbouring range; there are no writes). This differs
    from the M = 1 parallel path, which splits the data array for the
    difference-based transform.

  • Ties / degenerate input. Like chatterjee_correlation, the function
    assumes distinct Y (continuous data). A constant Y returns a quiet NaN; this
    is detected on the input directly, since rank() collapses tied values.

  • Choice of M. The asymptotic null variance is minimised at M ~ sqrt(n); the
    choice is documented but left to the caller.

Tests

Added to test_chatterjee_correlation.cpp, covering float, double, and
long double:

  • Exact closed-form checks against the paper's Remark 2.5 (strictly increasing
    and strictly decreasing dependence), which require no external reference.
  • Small exact spot values computed independently as rationals.
  • Constant-Y → NaN, and invariance under strictly increasing transforms of X
    and Y.
  • Sequential/parallel agreement across several M (under the parallel build).

The sequential path was verified locally under b2 with cxxstd=14 and
cxxstd=17 (clang, arm64).

@mborland mborland left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've approved the workflow. This all looks good to me! @NAThompson if you have time could you give this a quick look?

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.30070% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.40%. Comparing base (0cfea2f) to head (2709fb2).
⚠️ Report is 8 commits behind head on develop.

Files with missing lines Patch % Lines
...e/boost/math/statistics/chatterjee_correlation.hpp 98.36% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1414      +/-   ##
===========================================
+ Coverage    95.39%   95.40%   +0.01%     
===========================================
  Files          826      827       +1     
  Lines        68919    69062     +143     
===========================================
+ Hits         65747    65891     +144     
+ Misses        3172     3171       -1     
Files with missing lines Coverage Δ
test/test_chatterjee_correlation.cpp 100.00% <100.00%> (ø)
...e/boost/math/statistics/chatterjee_correlation.hpp 97.05% <98.36%> (+1.93%) ⬆️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cfea2f...2709fb2. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mborland mborland linked an issue Jul 2, 2026 that may be closed by this pull request
@mborland

mborland commented Jul 2, 2026

Copy link
Copy Markdown
Member

macOS failure has already been fixed on develop. Merging. Thank you for this contribution!

@mborland mborland merged commit 8ee12a5 into boostorg:develop Jul 2, 2026
73 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can Chatterjee Correlation be improved?

2 participants